Data cleaning for Antirrhinum majus data set from 2012

Tom Ellis, June 2017

In this notebook we will examine an empirical dataset using the snapdragon Antirrhinum majus.

In 2012 we collected open-pollinated seed capsules from wild mothers and genotypes samples of the offsping. A single seed capsule contains up to several hundred seeds from between 1 and lots of pollen donors. We also collected tissue and GPS positions for as many of the adults reproductive plants as we could find.

These data are those desribed and analysed by Ellis et al. (2018), and are available from the IST Austria data repository (DOI:10.15479/AT:ISTA:95).

Below, we will do an initial data inspection to weed out dubious loci and individuals. It can be argued that this process was overly conservative, and we threw out a lot of useful data, so you need not necessarily be so critical of your own data.


In [93]:
import numpy as np
from faps import *
import matplotlib.pyplot as plt
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Data inspection

Import genotype data for the reproductive adults and offspring. The latter includes information on the ID of the maternal mother.


In [94]:
progeny = read_genotypes('../manuscript_faps/data_files/offspring_SNPs_2012.csv', mothers_col=1, genotype_col=2)
adults  = read_genotypes('../manuscript_faps/data_files/parents_SNPs_2012.csv')

iix = [i in adults.names for i in progeny.mothers.tolist()]
progeny = progeny.subset(iix)

GPS data

We will also import data GPS data for 2219 individuals tagged as alive in 2012. Since not all of these have been genotyped, we reformat the data to match the genotype data.


In [95]:
gps_pos = np.genfromtxt('../manuscript_faps/data_files/amajus_GPS_2012.csv', delimiter=',', skip_header=1, usecols=[3,4]) # import CSV file
gps_lab = np.genfromtxt('../manuscript_faps/data_files/amajus_GPS_2012.csv', delimiter=',', skip_header=1, usecols=0, dtype='str') # import CSV file
# subset GPS data to match the genotype data.
ix = [i for i in range(len(gps_lab)) if gps_lab[i] in adults.names]
gps_pos, gps_lab = gps_pos[ix], gps_lab[ix]

17 individuals are actually from about 15km to the East. It is hard to imagine that these individuals could contribute to the pollen pool of the mothers, so let's remove these from the sample.


In [96]:
plt.scatter(gps_pos[:,0], gps_pos[:,1])


Out[96]:
<matplotlib.collections.PathCollection at 0x7f610159f290>

In [97]:
ix = [i for i in range(len(gps_lab)) if gps_pos[i,0] > -5000]
gps_pos, gps_lab = gps_pos[ix], gps_lab[ix]

Genotype information

As a sanity check, confirm that the marker names really do match.


In [98]:
all([progeny.markers[i] == adults.markers[i] for i in range(progeny.nloci)])


Out[98]:
True

Tissue from the adults and progeny was dried in different ways. For the progeny, I didnt use enough silica gel to dry the tissue rapidly, and the DNA became degraded. Reflecting this, although genotype dropouts (the rate at which genotype information at a single locus fails to amplify) is respectable for the adults, but dire for the offspring.


In [99]:
print adults.missing_data().max()
print progeny.missing_data().max()


0.027972027972027972
0.7916666666666666

Luckily a lot of this is driven by a small number of loci/individuals with very high dropout rates.


In [100]:
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(wspace=0.2, hspace=0.2)

mdo = fig.add_subplot(2,2,1)
mdo.hist(progeny.missing_data('marker'), bins=np.arange(0, 1, 0.05))
mdo.set_xlabel("Missing data")
mdo.set_ylabel("Number of loci")
mdo.set_title('Per locus: offspring')

indo = fig.add_subplot(2,2,2)
indo.hist(progeny.missing_data(by='individual'), bins=np.arange(0, 1, 0.05))
indo.set_xlabel("Missing data")
indo.set_ylabel("Number of loci")
indo.set_title('Per indiviudual: offspring')

mda = fig.add_subplot(2,2,3)
mda.hist(adults.missing_data('marker'), bins=np.arange(0, 1, 0.05))
mda.set_xlabel("Missing data")
mda.set_ylabel("Number of loci")
mda.set_title('Per locus: adults')

inda = fig.add_subplot(2,2,4)
inda.hist(adults.missing_data(by='individual'), bins=np.arange(0, 1, 0.05))
inda.set_xlabel("Missing data")
inda.set_ylabel("Number of loci")
inda.set_title('Per indiviudual: adults')


Out[100]:
Text(0.5,1,u'Per indiviudual: adults')

Although overall per locus drop-out rates are low for the adults, there are some individuals with alarmingly high amounts of missing data. Candidates with very few loci typed can come out as being highly compatible with many offspring, just because there is insufficient information to exclude them.


In [101]:
print adults.missing_data(by='individual').max()
print progeny.missing_data('individual').max()


0.8985507246376812
0.9710144927536232

Count, then remove individuals with >5% missing data.


In [102]:
print "Offspring:", len(np.array(progeny.names)[progeny.missing_data(1) > 0.05])
print "Parents:", len(np.array(adults.names)[adults.missing_data(1) > 0.05])

progeny = progeny.subset(    individuals= progeny.missing_data(1) < 0.05)
adults  = adults.subset(individuals= adults.missing_data(1) < 0.05)


Offspring: 688
Parents: 66

Histograms look much better. It would still worth removing some of the dubious loci with high drop-out rates though.


In [103]:
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(wspace=0.2, hspace=0.2)

mdo = fig.add_subplot(2,2,1)
mdo.hist(progeny.missing_data('marker'), bins=np.arange(0, 0.7, 0.05))
mdo.set_xlabel("Missing data")
mdo.set_ylabel("Number of loci")
mdo.set_title('Per locus: offspring')

indo = fig.add_subplot(2,2,2)
indo.hist(progeny.missing_data(by='individual'), bins=np.arange(0, 0.7, 0.05))
indo.set_xlabel("Missing data")
indo.set_ylabel("Number of loci")
indo.set_title('Per indiviudual: offspring')

mda = fig.add_subplot(2,2,3)
mda.hist(adults.missing_data('marker'), bins=np.arange(0, 0.7, 0.05))
mda.set_xlabel("Missing data")
mda.set_ylabel("Number of loci")
mda.set_title('Per locus: adults')

inda = fig.add_subplot(2,2,4)
inda.hist(adults.missing_data(by='individual'), bins=np.arange(0, 0.7, 0.05))
inda.set_xlabel("Missing data")
inda.set_ylabel("Number of loci")
inda.set_title('Per indiviudual: adults')


Out[103]:
Text(0.5,1,u'Per indiviudual: adults')

Remove the loci with dropouts >10% from both the offspring and adult datasets.


In [104]:
print np.array(progeny.markers)[progeny.missing_data(0) >= 0.1]

progeny= progeny.subset(loci= progeny.missing_data(0) < 0.1)
adults = adults.subset(loci = progeny.missing_data(0) < 0.1)


['s217_2722063']

Allele frequency and heterozygosity generally show the convex pattern one would expect. An exception is the locus with allele frequency at around 0.4, but heterozygosity >0.7, which is suspect, and indicative of a possible outlier.


In [105]:
plt.scatter(adults.allele_freqs(), adults.heterozygosity(0))
plt.xlabel('Allele frequency')
plt.ylabel('Heterozygosity')
plt.show()


Loci with low heterozygosity are not dangerous in themselves; they might contribute some information, albeit little. To be on the safe side, let's remove loci with less than 0.2 heterozygosity, and the errant locus with high heterozygosity.


In [106]:
print "Heterozygosity > 0.7:", adults.markers[adults.heterozygosity(0) >0.7]
print "Heterozygosity < 0.2:", progeny.markers[adults.heterozygosity(0) < 0.2]

progeny = progeny.subset(loci= (adults.heterozygosity(0) > 0.2) * (adults.heterozygosity(0) < 0.7))
adults  = adults.subset( loci= (adults.heterozygosity(0) > 0.2) * (adults.heterozygosity(0) < 0.7))


Heterozygosity > 0.7: ['s217_2722063']
Heterozygosity < 0.2: ['s154_504353' 's320_60828' 's316_93292']

Summary of genotype data

This leaves us with a dataset of 61 loci for which allele frequency and heterozygosity are highest around 0.5, which is what we would like. In particular, heterozygosity (and hence homozygosity) among the adults is humped around 0.5, which is a good sign that parents should be readily distinguishable. There is nevertheless substantial spread towards zero and one for the progeny data however, which is less than ideal.


In [107]:
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(wspace=0.1, hspace=0.2)

afp = fig.add_subplot(2,2,1)
afp.hist(adults.allele_freqs())
afp.set_title('Adults')
afp.set_xlabel("Allele frequency")

afo = fig.add_subplot(2,2,2)
afo.hist(progeny.allele_freqs())
afo.set_title('Offspring')
afo.set_xlabel("Allele frequency")

hetp = fig.add_subplot(2,2,3)
hetp.hist(adults.heterozygosity(0))
hetp.set_xlabel("Heterozygosity")

heto = fig.add_subplot(2,2,4)
heto.hist(progeny.heterozygosity(0))
heto.set_xlabel("Heterozygosity")


Out[107]:
Text(0.5,0,u'Heterozygosity')

The effective number of loci can be seen as the number of loci at which one can make compare the offspring, maternal and candidate paternal genotype (i.e. those loci with no missing data). Given how high dropouts are in the offspring, it is worthwhile to check the effective number of loci for this dataset.

In fact, effective number of loci is good. The minimum number of valid loci to compare is 46, and in 99% of cases there are 57 or more loci.


In [108]:
np.unique([progeny.mothers[i] for i in range(progeny.size) if progeny.mothers[i] not in adults.names])
ix = [i for i in range(progeny.size) if progeny.mothers[i] in adults.names]
progeny = progeny.subset(ix)

In [109]:
mothers = adults.subset(progeny.parent_index('m', adults.names))
neloci  = effective_nloci(progeny, mothers, adults)
plt.hist(neloci.flatten(), bins=np.arange(45.5,63.5,1))
plt.show()


Finally, print some summary statistics about the quality of the genotype information in the data set.


In [110]:
print(adults.nloci)
print progeny.missing_data(0).mean()
print adults.missing_data(0).mean()
print adults.heterozygosity(0).min(), adults.heterozygosity(0).max()
print adults.allele_freqs().min(), adults.allele_freqs().max()


64
0.01730390401146132
0.007726070226070227
0.2000962000962001 0.5483405483405484
0.2308624031007752 0.8760975609756098

Example family: L1872

The progeny dataset consists of offspring from multiple families that were genotyped at the same time. It was convenient to consider them as one so far to tidy up the genotype data, but for subsequent analysis we need to split them up into their constituent full sib families. This is easy to do with split, which returns a list of genotypeArray objects.


In [111]:
prlist = progeny.split(progeny.mothers)
len(prlist)


Out[111]:
57

By way of a sanity check we will examine one of the largest families in detail. After the data filtering above, there are 20 offspring from mother L1872. Distributions of missing data, heterozygosity and allele frequency at each locus suggest no reason for alarm.


In [112]:
ex_progeny = prlist[2]
ex_mother  = adults.subset(ex_progeny.parent_index('m', adults.names))

ex_progeny.size


Out[112]:
20

Family structure

Cluster the family into sibships. I have set the proportion of missing parents to 0.1; we have removed 140 of the 2219 (6%) candidates logged as alive in 2012, and I allow for 10% of candidates having been missed. In fact the results do not depend on the parameter unless it is unrealistically high.


In [113]:
allele_freqs = adults.allele_freqs() # population allele frequencies
ex_patlik    = paternity_array(ex_progeny, ex_mother, adults, 0.0015, missing_parents=0.1)
ex_sc        = sibship_clustering(ex_patlik, 1000)

We can first look at the dendrogram of relatedness between individuals derived from the array of paternity likleihoods.


In [114]:
from scipy.cluster.hierarchy import dendrogram
dendrogram(ex_sc.linkage_matrix, orientation='left', color_threshold=0,
           above_threshold_color='black')
plt.show()


We can compare this to the most-likely partition structure to get a rough idea of what as going on. This partition groups offspring into 7 full sibships and has a posterior probability of 0.8. The partition structure simply labels individuals 0 to 20 with a unique, arbitrary identifier. For example, individuals 2 and 3 are grouped into an especially large family labelled '1'.


In [115]:
print "most-likely partition:", ex_sc.mlpartition
print "unique families:", np.unique(ex_sc.mlpartition)
print "partition probability:", np.exp(ex_sc.prob_partitions.max())


most-likely partition: [2 1 1 4 3 2 1 5 4 1 1 3 3 3 3 1 3 1 7 6]
unique families: [1 2 3 4 5 6 7]
partition probability: 0.5216846355100478

We can recover posterior probabilties of paternity for each candidate on each offspring using prob_paternity. For most offspring, there is a single candidate with a probability of paternity close to one.


In [116]:
postpat = ex_sc.prob_paternity()

# names of most probable candidates
mx = np.array([np.where(postpat[i].max() == postpat[i])[0][0] for i in range(ex_progeny.size)])

from pandas import DataFrame as df
df([adults.names[mx], np.exp(postpat.max(1))]).T


Out[116]:
0 1
0 M0880 1
1 M0819 1
2 M0819 1
3 M0698 1
4 M0122 1
5 M0880 1
6 M0819 1
7 M0125 0.99994
8 M0698 1
9 M0819 1
10 M0819 1
11 M0122 1
12 M0122 0.788393
13 M0122 1
14 M0122 1
15 M0819 1
16 M0122 0.983386
17 M0819 1
18 M0107 1
19 M0854 1

Family sizes

Consistent with the results for many families (shown below), the posterior distributions for family size suggest many small families and a smaller number of larger families.


In [117]:
fig = plt.figure(figsize=(15,6))

nf = fig.add_subplot(1,2,1)
nf.plot(range(1,ex_progeny.size+1), ex_sc.nfamilies())
nf.set_xlabel('Number of families')
nf.set_ylabel('Probability')

fs = fig.add_subplot(1,2,2)
fs.plot(range(1,ex_progeny.size+1), ex_sc.family_size())
fs.set_xlabel('Family size')
plt.show()


Geographic positions

Intuitively, one would expect most pollen donors to be fairly close to the mother. Since the most probable partition had fairly strong support and identified a set of candidates with posterior probabilities close to one, it is reasonable to use these individuals to get an idea of where the pollen donors are to be found.


In [118]:
ix =[i for i in range(len(gps_lab)) if gps_lab[i] in adults.names[mx]]
gps_cands = gps_pos[ix]
gps_ex =  gps_pos[gps_lab == "L1872"].squeeze()

The map below shows the spatial positions of all individuals in the sample in green. Overlaid are the mother in red, and top candidates in blue. The likley candidates are indeed found close to the mother along the lower (southern-most) road, with two individuals on the upper (northern) road. This gives us no cause to doubt the validity of the paternity results.


In [119]:
second = np.sort(postpat, 1)[:, 1]
sx = np.array([np.where(second[i] == postpat[i])[0][0] for i in range(ex_progeny.size)])
gps_sec = gps_pos[np.unique(sx)]

In [120]:
fig = plt.figure(figsize=(16.9/2.54,6.75/2.54))
#plt.figure(figsize=(12.5,5)

plt.xlabel('East-West positition (m)')
plt.ylabel('North-South positition (m)')
plt.xlim(-2500,2000)
plt.scatter(gps_pos[:,0],  gps_pos[:,1], s=5, color='green', alpha=0.5)
plt.scatter(gps_sec[:,0],  gps_sec[:,1], color='gold')
plt.scatter(gps_cands[:,0],gps_cands[:,1], color='blue')
plt.scatter(gps_ex[0], gps_ex[1], color='red', s=40, edgecolors='black')
plt.show()


We can use these data to get a very rough dispersal kernal. Most pollen comes from within 50m of the maternal plant.


In [121]:
dists = np.sqrt((gps_ex[0] - gps_cands[:,0])**2 + (gps_ex[1] - gps_cands[:,1])**2)
print "Mean dispersal =",mean(dists)

plt.hist(dists, bins=np.arange(0,650,50))
plt.show()


Mean dispersal = 48.04159641841041

In contrast, the second-most-likely candidates are on average more than 800m from the maternal plant.


In [122]:
dists2 = np.sqrt((gps_ex[0] - gps_sec[:,0])**2 + (gps_ex[1] - gps_sec[:,1])**2)
print "Mean dispersal =",mean(dists2)


Mean dispersal = 907.2250748822763

Missing data in the candidates

When candidate fathers have substantial missing data, they can have apparently high likelihoods of paternity just because there are fewer opportunities to show an incompatibility.

I previously found that candidates with ~10% missing data tended to be assigned as the true father alarmingly frequently. Here, we have already excluded candidates with more than 5% missing data. As a sanity to check to ensure missing data is not a substantial problem here, we can check how often the most likely candidate has 0, 1, 2, 3 or 4 loci with no genotype information.

The histograms below show these distributions based on probabilities before and after sibship clustering. Top candidates have either one or zero missing data points. This is probably really a random draw from the pool of candidates, because these are the most common categories.


In [123]:
md = ex_sc.prob_paternity()
px = [np.where(md[i]                  == md[i].max())[0][0]                     for i in range(ex_progeny.size)]
lx = [np.where(ex_patlik.lik_array[i] == ex_patlik.lik_array[i][i].max())[0][0] for i in range(ex_progeny.size)]

fig= plt.figure(figsize=(16.9/2.54, 6/2.54))
fig.subplots_adjust(wspace=0.3)


lh= fig.add_subplot(1,2,1)
lh.bar(range(4), np.unique(adults.missing_data(1), return_counts=True)[1])
lh.set_title('All adults')
lh.set_xlabel('Number failed loci')
lh.set_ylabel('Number of individuals')
lh.set_xticks(np.arange(4)+0.4)
lh.set_xticklabels(np.arange(4))

ph= fig.add_subplot(1,2,2)
ph.bar(np.arange(2), np.unique(adults.missing_data(1)[px], return_counts=True)[1])
ph.set_title('Top candidates')
ph.set_xlabel('Number failed loci')
ph.set_xticks(np.arange(4)+0.4)
ph.set_xticklabels(np.arange(4))
plt.show()


Relatedness

Another explanation for splitting a full sibship into multiple smaller sibships would be that there is relatedness among candidates, and a relative of the true sire can sometimes have a higher likelihood of paternity than the true sire just by chance. If this is the case we would expect the most likely candidates to be more related to one another than we would expect if they were a random draw from the population.

First, calculate a matrix of pairwise relatedness between all individuals in the sample of candidates:


In [124]:
matches = [(adults.geno[:,:,i][np.newaxis] == adults.geno[:,:,j][:, np.newaxis]).mean(2) for i in [0,1] for j in [0,1]]
matches = np.array(matches)
matches = matches.mean(0)

These histograms show pairwise relatedness for all pairs of candidates in blue, and the most probable father of each individual after clustering in orange (I have excluded duplicate candidates). There is no reason to suppose the top candidates are anything other than a random draw.


In [125]:
ux = np.unique(px)
top_r = np.array([matches[ux[i],ux] for i in range(len(ux))])

# weight bars to ensure the histograms sum to one.
w1 = np.ones_like(matches[np.triu_indices(2079, 1)]) / float(len(matches[np.triu_indices(2079, 1)]))
w2 = np.ones_like(top_r[np.triu_indices(len(ux), 1)]) / float(len(top_r[np.triu_indices(len(ux), 1)]))

fig= plt.figure(figsize=(9/2.54, 8/2.54))
plt.hist(matches[np.triu_indices(2079, 1)],  histtype='step', bins=np.arange(0.3,0.7, 0.025), weights=w1)
plt.hist(top_r[np.triu_indices(len(ux), 1)], histtype='step', bins=np.arange(0.3,0.7, 0.025), weights=w2)
plt.xlabel('Relatedness')
plt.ylabel('Density')
plt.show()


Multiple families

The code becomes more challenging because we will need to perform operations on every element in this list. Luckily this is straightforward in Python if we use list comprehensions. For example, we can pull out and plot the number of offspring in each half-sibling array:


In [126]:
plt.hist([prlist[i].size for i in range(len(prlist))], bins=np.arange(0,25))
plt.show()


All of these families are samples from much larger half sib arrays, so comparing full-sibship sizes and number is even more difficult if there are different numbers of offspring. For this reason we can pick out only those families with 17 or more offspring.

This cell splits genotype data into maternal families of 17 or more offspring, then pick 17 offspring at random (there is no meaning in the order of individuals in the genotypeArray object, so taking the first 17 is tantamount to choosing at random). This leaves us with 18 familes of 17 offspring.


In [127]:
# split into maternal families
mlist  = mothers.split(progeny.mothers)
prlist = progeny.split(progeny.mothers)
# families with 20 or more offspring
prog17 = [prlist[i] for i in range(len(prlist)) if prlist[i].size >=17] 
mlist = [mlist[i] for i in range(len(prlist)) if prlist[i].size >=17]
# take the first 17 offspring
prog17 = [x.subset(range(17)) for x in prog17]
mlist  = [x.subset(range(17)) for x in mlist]

Calculate likelihoods of paternity for each family. This took 3 seconds on a 2010 Macbook Pro; your mileage may vary. In order to do so we also need population allele frequencies.


In [128]:
allele_freqs = adults.allele_freqs() # population allele frequencies

from time import time
t0=time()
patlik = paternity_array(prog17, mlist, adults, mu=0.0013)
print "Completed in {} seconds.".format(time() - t0)


Completed in 3.23130893707 seconds.

The next step is clustering each family into full sibships.


In [129]:
t1 = time()
sc = sibship_clustering(patlik)
print "Completed in {} seconds.".format(time() - t1)


Completed in 1.0766479969 seconds.

Calculate probability distributions for family size and number of families for each array.


In [136]:
nfamilies = [x.nfamilies() for x in sc]
nfamilies = np.array(nfamilies)

famsize = [x.family_size() for x in sc]
famsize = np.array(famsize)

Plots below show the probability distributions for the number and sizes of families. Grey bars show 95% credible intervals (see CDF plots below). Samples of 17 offspring are divided into between four and 16 full-sibling families consisting of between one and eight individuals. Most families seem to be small, with a smaller number of large families.


In [137]:
fig = plt.figure(figsize=(16.9/2.54, 6/2.54))
fig.subplots_adjust(wspace=0.3, hspace=0.1)

nf = fig.add_subplot(1,2,1)
nf.set_ylabel('Probability density')
nf.set_xlabel('Number of families')
nf.set_ylim(-0.005,0.2)
nf.set_xlim(0,18)
nf.bar(np.arange(0.5,17.5), nfamilies.sum(0)/nfamilies.sum(), color='1', width=1)
nf.bar(np.arange(3.5,16.5), (nfamilies.sum(0)/nfamilies.sum())[3:16], color='0.75', width=1)

fs = fig.add_subplot(1,2,2)
fs.set_xlabel('Family size')
#fs.set_ylabel('Probability density')
fs.set_ylim(-0.05,0.8)
fs.set_xlim(0,17)
fs.bar(np.arange(0.5,17.5), famsize.sum(0)/famsize.sum(), color='1', width=1)
fs.bar(np.arange(0.5,6.5), (famsize.sum(0)/famsize.sum())[:6], color='0.75', width=1)

plt.show()


Cumulative probability density plots demonstrate the credible intervals for family size and number.


In [138]:
fig = plt.figure(figsize=(15, 6))
fig.subplots_adjust(wspace=0.3, hspace=0.1)

nf = fig.add_subplot(1,2,1)
nf.set_ylabel('Cumulative density')
nf.set_xlabel('Number of families')
nf.set_xlim(0,20)
nf.set_ylim(0,1.05)
nf.plot(np.arange(1,18), np.cumsum(nfamilies.sum(0)/nfamilies.sum()))
nf.axhline(0.975, 0.05, 0.95, linestyle='dashed')
nf.axhline(0.025, 0.05, 0.95, linestyle='dashed')
nf.grid()

fs = fig.add_subplot(1,2,2)
fs.set_ylabel('Cumulative density')
fs.set_xlabel('Family size')
fs.set_xlim(0,21)
fs.set_ylim(0,1.05)
fs.plot(np.arange(1,18), np.cumsum(famsize.sum(0)/famsize.sum()))
fs.axhline(0.975, 0.05, 0.95, linestyle='dashed')
fs.axhline(0.025, 0.05, 0.95, linestyle='dashed')
fs.grid()